Skip to content

3670: perf: use aligned pointer reads for SparkUnsafeRow field accessors#46

Closed
martin-augment wants to merge 4 commits intomainfrom
pr-3670-2026-03-12-11-23-44
Closed

3670: perf: use aligned pointer reads for SparkUnsafeRow field accessors#46
martin-augment wants to merge 4 commits intomainfrom
pr-3670-2026-03-12-11-23-44

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

3670: To review by AI

SparkUnsafeRow field offsets are always 8-byte aligned (the JVM
guarantees 8-byte alignment on the base address, bitset_width is a
multiple of 8, and each field slot is 8 bytes). This means we can
safely use ptr::read() instead of the from_le_bytes(slice) pattern
for all typed accesses, avoiding slice creation and try_into overhead.

Move primitive accessor implementations out of the SparkUnsafeObject
trait defaults and into each concrete impl via a macro parameterized
on the read method:
- SparkUnsafeRow uses ptr::read() (aligned)
- SparkUnsafeArray uses ptr::read_unaligned() (may be unaligned when
  nested in a row's variable-length region)

Also switch is_null_at/set_not_null_at in SparkUnsafeRow from
read_unaligned/write_unaligned to aligned read/write, since the null
bitset is always at 8-byte aligned offsets within the row.
The test_append_null_struct_field_to_struct_builder test used a plain
[u8; 16] stack buffer with no alignment guarantee. Since is_null_at
performs aligned i64 reads, Miri flags this as undefined behavior when
the buffer lands at a non-8-byte-aligned address.

Wrap the buffer in a #[repr(align(8))] struct to match the alignment
that real Spark UnsafeRow data always has from JVM memory.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 12, 2026

Walkthrough

The changes modify Spark unsafe object accessors by introducing macro-based primitive getter implementations. In list.rs, a macro is added to ensure unaligned-aware reads for SparkUnsafeArray. In row.rs, nine new public primitive getter methods are introduced (get_boolean, get_byte, get_short, get_int, get_long, get_float, get_double, get_date, get_timestamp), inline implementations are replaced with macro-generated versions, and alignment assumptions are updated to use aligned reads instead of unaligned reads for SparkUnsafeRow primitive access.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-3670-2026-03-12-11-23-44

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors how primitive fields are accessed within Spark UnsafeRow and UnsafeArray structures in Rust, improving performance and correctness by explicitly handling memory alignment. It introduces a macro to generate accessor methods that use either aligned ptr::read() for SparkUnsafeRow (which guarantees 8-byte alignment) or ptr::read_unaligned() for SparkUnsafeArray (which may not be aligned). This change optimizes memory access patterns based on the underlying data structure's alignment properties.

Highlights

  • Refactored Primitive Field Accessors: Introduced a new macro impl_primitive_accessors to generate type-specific field accessor methods for SparkUnsafeObject, moving implementations from direct trait methods into this macro.
  • Optimized SparkUnsafeRow Access: Configured SparkUnsafeRow to use aligned ptr::read() for field access, leveraging its guaranteed 8-byte alignment for improved performance and correctness.
  • Safe SparkUnsafeArray Access: Configured SparkUnsafeArray to use ptr::read_unaligned() for field access, accounting for its potential unaligned base address when nested within a row's variable-length region.
  • Bitset Operation Alignment: Updated is_null_at and set_not_null_at methods in SparkUnsafeRow to use aligned ptr::read() and ptr::write() for bitset manipulation, clarifying the alignment guarantee.
  • Test Case Alignment Improvement: Modified a test case to ensure SparkUnsafeRow is initialized with an 8-byte aligned buffer, reflecting real-world Spark UnsafeRow layouts more accurately.
Changelog
  • native/core/src/execution/shuffle/spark_unsafe/list.rs
    • Imported the new impl_primitive_accessors macro.
    • Applied impl_primitive_accessors!(read_unaligned) to SparkUnsafeArray to ensure safe access for potentially unaligned data.
  • native/core/src/execution/shuffle/spark_unsafe/row.rs
    • Added new primitive accessor methods (e.g., get_boolean, get_int, get_long, get_date, get_timestamp) to the SparkUnsafeObject trait definition.
    • Removed the previous inline implementations of these primitive accessor methods from the SparkUnsafeObject trait.
    • Defined the impl_primitive_accessors macro, which generates the actual implementations of the primitive accessor methods using either ptr::read() or ptr::read_unaligned().
    • Applied impl_primitive_accessors!(read) to SparkUnsafeRow, utilizing aligned reads due to its guaranteed 8-byte alignment.
    • Updated the is_null_at and set_not_null_at methods to use aligned ptr::read() and ptr::write() respectively, with comments clarifying the alignment guarantee.
    • Modified a test case (test::test_append_struct_with_null_field) to initialize SparkUnsafeRow with an 8-byte aligned buffer.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization by using aligned pointer reads for SparkUnsafeRow field accessors, and unaligned reads for SparkUnsafeArray. A macro is used to generate the accessor methods, which is a clean approach to handle the different alignment guarantees. While the logic for choosing between aligned and unaligned reads is sound, the implementation of the primitive accessors overlooks data endianness. This introduces a critical correctness bug that can lead to silent data corruption on big-endian architectures. I have provided a detailed comment and a code suggestion to fix this issue.

Comment on lines +175 to +255
macro_rules! impl_primitive_accessors {
($read_method:ident) => {
#[inline]
fn get_boolean(&self, index: usize) -> bool {
let addr = self.get_element_offset(index, 1);
debug_assert!(
!addr.is_null(),
"get_boolean: null pointer at index {index}"
);
// SAFETY: addr points to valid element data within the row/array region.
unsafe { *addr != 0 }
}

#[inline]
fn get_byte(&self, index: usize) -> i8 {
let addr = self.get_element_offset(index, 1);
debug_assert!(!addr.is_null(), "get_byte: null pointer at index {index}");
// SAFETY: addr points to valid element data (1 byte) within the row/array region.
unsafe { *(addr as *const i8) }
}

#[inline]
fn get_short(&self, index: usize) -> i16 {
let addr = self.get_element_offset(index, 2) as *const i16;
debug_assert!(!addr.is_null(), "get_short: null pointer at index {index}");
// SAFETY: addr points to valid element data (2 bytes) within the row/array region.
unsafe { addr.$read_method() }
}

#[inline]
fn get_int(&self, index: usize) -> i32 {
let addr = self.get_element_offset(index, 4) as *const i32;
debug_assert!(!addr.is_null(), "get_int: null pointer at index {index}");
// SAFETY: addr points to valid element data (4 bytes) within the row/array region.
unsafe { addr.$read_method() }
}

#[inline]
fn get_long(&self, index: usize) -> i64 {
let addr = self.get_element_offset(index, 8) as *const i64;
debug_assert!(!addr.is_null(), "get_long: null pointer at index {index}");
// SAFETY: addr points to valid element data (8 bytes) within the row/array region.
unsafe { addr.$read_method() }
}

#[inline]
fn get_float(&self, index: usize) -> f32 {
let addr = self.get_element_offset(index, 4) as *const f32;
debug_assert!(!addr.is_null(), "get_float: null pointer at index {index}");
// SAFETY: addr points to valid element data (4 bytes) within the row/array region.
unsafe { addr.$read_method() }
}

#[inline]
fn get_double(&self, index: usize) -> f64 {
let addr = self.get_element_offset(index, 8) as *const f64;
debug_assert!(!addr.is_null(), "get_double: null pointer at index {index}");
// SAFETY: addr points to valid element data (8 bytes) within the row/array region.
unsafe { addr.$read_method() }
}

#[inline]
fn get_date(&self, index: usize) -> i32 {
let addr = self.get_element_offset(index, 4) as *const i32;
debug_assert!(!addr.is_null(), "get_date: null pointer at index {index}");
// SAFETY: addr points to valid element data (4 bytes) within the row/array region.
unsafe { addr.$read_method() }
}

#[inline]
fn get_timestamp(&self, index: usize) -> i64 {
let addr = self.get_element_offset(index, 8) as *const i64;
debug_assert!(
!addr.is_null(),
"get_timestamp: null pointer at index {index}"
);
// SAFETY: addr points to valid element data (8 bytes) within the row/array region.
unsafe { addr.$read_method() }
}
};
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation of primitive accessors has a correctness issue on big-endian systems. ptr::read() and ptr::read_unaligned() interpret bytes using the native endianness of the CPU. However, Spark's Unsafe format is little-endian. This will lead to incorrect data being read on big-endian architectures, causing silent data corruption.

To ensure portability and correctness, you should explicitly handle the little-endian format. For integer types, you can use T::from_le(). For floating-point types, you should read the data as an integer of the same size, convert its endianness, and then convert the bits to a float (e.g., f32::from_bits(u32::from_le(...))).

macro_rules! impl_primitive_accessors {
    ($read_method:ident) => {
        #[inline]
        fn get_boolean(&self, index: usize) -> bool {
            let addr = self.get_element_offset(index, 1);
            debug_assert!(
                !addr.is_null(),
                "get_boolean: null pointer at index {index}"
            );
            // SAFETY: addr points to valid element data within the row/array region.
            unsafe { *addr != 0 }
        }

        #[inline]
        fn get_byte(&self, index: usize) -> i8 {
            let addr = self.get_element_offset(index, 1);
            debug_assert!(!addr.is_null(), "get_byte: null pointer at index {index}");
            // SAFETY: addr points to valid element data (1 byte) within the row/array region.
            unsafe { *(addr as *const i8) }
        }

        #[inline]
        fn get_short(&self, index: usize) -> i16 {
            let addr = self.get_element_offset(index, 2) as *const i16;
            debug_assert!(!addr.is_null(), "get_short: null pointer at index {index}");
            // SAFETY: addr points to valid element data (2 bytes) within the row/array region.
            // Spark's Unsafe format is little-endian, so we must convert from LE to native.
            unsafe { i16::from_le(addr.$read_method()) }
        }

        #[inline]
        fn get_int(&self, index: usize) -> i32 {
            let addr = self.get_element_offset(index, 4) as *const i32;
            debug_assert!(!addr.is_null(), "get_int: null pointer at index {index}");
            // SAFETY: addr points to valid element data (4 bytes) within the row/array region.
            // Spark's Unsafe format is little-endian, so we must convert from LE to native.
            unsafe { i32::from_le(addr.$read_method()) }
        }

        #[inline]
        fn get_long(&self, index: usize) -> i64 {
            let addr = self.get_element_offset(index, 8) as *const i64;
            debug_assert!(!addr.is_null(), "get_long: null pointer at index {index}");
            // SAFETY: addr points to valid element data (8 bytes) within the row/array region.
            // Spark's Unsafe format is little-endian, so we must convert from LE to native.
            unsafe { i64::from_le(addr.$read_method()) }
        }

        #[inline]
        fn get_float(&self, index: usize) -> f32 {
            let addr = self.get_element_offset(index, 4) as *const u32;
            debug_assert!(!addr.is_null(), "get_float: null pointer at index {index}");
            // SAFETY: addr points to valid element data (4 bytes) within the row/array region.
            // Spark's Unsafe format is little-endian. Read as u32, convert, then transmute.
            unsafe { f32::from_bits(u32::from_le(addr.$read_method())) }
        }

        #[inline]
        fn get_double(&self, index: usize) -> f64 {
            let addr = self.get_element_offset(index, 8) as *const u64;
            debug_assert!(!addr.is_null(), "get_double: null pointer at index {index}");
            // SAFETY: addr points to valid element data (8 bytes) within the row/array region.
            // Spark's Unsafe format is little-endian. Read as u64, convert, then transmute.
            unsafe { f64::from_bits(u64::from_le(addr.$read_method())) }
        }

        #[inline]
        fn get_date(&self, index: usize) -> i32 {
            let addr = self.get_element_offset(index, 4) as *const i32;
            debug_assert!(!addr.is_null(), "get_date: null pointer at index {index}");
            // SAFETY: addr points to valid element data (4 bytes) within the row/array region.
            // Spark's Unsafe format is little-endian, so we must convert from LE to native.
            unsafe { i32::from_le(addr.$read_method()) }
        }

        #[inline]
        fn get_timestamp(&self, index: usize) -> i64 {
            let addr = self.get_element_offset(index, 8) as *const i64;
            debug_assert!(
                !addr.is_null(),
                "get_timestamp: null pointer at index {index}"
            );
            // SAFETY: addr points to valid element data (8 bytes) within the row/array region.
            // Spark's Unsafe format is little-endian, so we must convert from LE to native.
            unsafe { i64::from_le(addr.$read_method()) }
        }
    };
}

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

/// Returns double value at the given index of the object.
#[inline]
fn get_double(&self, index: usize) -> f64 {
let addr = self.get_element_offset(index, 8);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned read on potentially unaligned nested struct from array

High Severity

The trait method get_struct returns a SparkUnsafeRow when called on a SparkUnsafeArray. Since SparkUnsafeRow uses aligned ptr::read() for all typed accesses, but the PR's own documentation states that SparkUnsafeArray base addresses "may be unaligned when nested within a row's variable-length region," the resulting SparkUnsafeRow from array.get_struct(idx, ...) would inherit that misalignment. Calling is_null_at, get_int, get_long, etc. on such a row invokes aligned ptr::read() on potentially unaligned memory, which is undefined behavior. This path is exercised in list.rs when arrays contain struct elements.

Additional Locations (2)
Fix in Cursor Fix in Web

@claude
Copy link
Copy Markdown

claude bot commented Mar 12, 2026

Code Review: perf: use aligned pointer reads for SparkUnsafeRow field accessors

Overall this is a well-motivated and well-documented performance optimization. The alignment reasoning is sound and the macro approach cleanly handles the two different cases.


Correctness: Endianness assumption

This is the most important issue. The old code used from_le_bytes(), which explicitly handles little-endian byte order. The new ptr::read() / ptr::read_unaligned() calls read in native byte order.

Spark UnsafeRow stores data in little-endian byte order. On x86_64 (LE) this is equivalent, but on a big-endian target (IBM Z or BE-ARM) the primitive reads would silently produce wrong values. Meanwhile SparkUnsafeArray::new still correctly reads the element count via i64::from_le_bytes, creating an inconsistency between the two.

If Comet only targets little-endian architectures this is fine, but it should be documented. Consider a module-level comment noting the LE assumption, or a compile-time check (e.g. const _: () = assert!(cfg!(target_endian = "little"), "...");) to reject non-LE builds explicitly.


Minor: $read_method silently ignored for get_boolean and get_byte

Single-byte reads have no alignment requirement, so ignoring $read_method for these two methods is correct. However, a macro parameter being silently unused for two of its nine generated methods can confuse future readers. A short inline comment (e.g. // 1-byte reads are alignment-safe; $read_method is intentionally unused here) would clarify the intent.


Minor: Alignment claim wording

The comment "JVM guarantee" for 8-byte base-address alignment is accurate in practice for HotSpot on x86_64, but the Java Language Specification does not formally require byte[] to be 8-byte aligned -- it is a HotSpot implementation detail. Since this code only runs in that specific JVM context via JNI it is fine in practice, but "HotSpot implementation guarantee" would be slightly more precise than "JVM guarantee."


SparkUnsafeArray::is_null_at consistency

SparkUnsafeArray::is_null_at (list.rs:161) correctly continues using read_unaligned for the null bitset since the array can be at an unaligned offset. This is consistent with impl_primitive_accessors!(read_unaligned) -- good.


Test fix is correct

The #[repr(align(8))] wrapper in the Miri test is the right approach. Miri correctly flagged the plain [u8; 16] buffer as UB once is_null_at switched to aligned read(), and this fix matches the real Spark UnsafeRow layout guarantee.


Summary

Area Assessment
Core optimization Correct and well-reasoned
Macro design Clean; read vs read_unaligned parameterization is a good pattern
Endianness Needs documentation or compile-time guard (most important)
get_boolean/get_byte ignoring $read_method Minor -- worth a comment
Test fix Correct

The endianness point is the one I would ask to be addressed before merging, even if only as a module-level comment or a compile_error! for non-LE targets.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 12, 2026

🤖 Augment PR Summary

Summary: Improves Spark UnsafeRow/UnsafeArray primitive field access by switching from byte-slice decoding to typed pointer reads, using aligned reads where layout guarantees alignment.

Changes:

  • Refactors `SparkUnsafeObject` primitive getters into per-type implementations via `impl_primitive_accessors!`.
  • Uses aligned `ptr::read`/`write` for `SparkUnsafeRow` null-bitset and field access; uses `read_unaligned` for `SparkUnsafeArray`.
  • Exports the new macro for reuse and wires it into `SparkUnsafeArray`.
  • Updates the struct-null test to use an 8-byte aligned backing buffer to match Spark’s layout assumptions.

Technical Notes: Adds alignment rationale to docs/comments to justify the aligned vs unaligned access strategy.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

let addr = self.get_element_offset(index, 4) as *const i32;
debug_assert!(!addr.is_null(), "get_int: null pointer at index {index}");
// SAFETY: addr points to valid element data (4 bytes) within the row/array region.
unsafe { addr.$read_method() }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptr::$read_method() reads in native endianness, but Spark UnsafeRow/UnsafeArray encodes primitives as little-endian (e.g., SparkUnsafeArray::new uses from_le_bytes). If this crate ever targets big-endian platforms, these accessors would decode incorrectly unless you explicitly guard/document little-endian-only support.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

let mask: i64 = 1i64 << (index & 0x3f);
let word_offset = (self.row_addr + (((index >> 6) as i64) << 3)) as *const i64;
let word: i64 = word_offset.read_unaligned();
let word: i64 = word_offset.read();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to word_offset.read() assumes row_addr is 8-byte aligned, but point_to_slice can point at arbitrary &[u8] (and Vec<u8> is not guaranteed 8-byte aligned by its type). If an unaligned slice is ever passed in, this becomes UB; it may be worth enforcing/documenting the alignment precondition at the API boundary.

Severity: high

Other Locations
  • native/core/src/execution/shuffle/spark_unsafe/row.rs:369
  • native/core/src/execution/shuffle/spark_unsafe/row.rs:370

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@native/core/src/execution/shuffle/spark_unsafe/row.rs`:
- Around line 280-283: The safe constructor
SparkUnsafeRow::point_to_slice(&[u8]) currently accepts unaligned slices but
downstream helpers generated by impl_primitive_accessors!(read) and methods
is_null_at() / set_not_null_at() use aligned ptr::read()/write(), so add a
runtime alignment check in point_to_slice that asserts the row base pointer is
8-byte aligned (e.g., check (row_addr as usize) % 8 == 0 or use
is_aligned_to(8)) and panic with a clear message if not; this enforces the
8-byte alignment invariant at the safe entry point and prevents undefined
behavior when benchmarks pass unaligned Vec<u8> buffers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1947f5c5-159a-4107-9f1f-01cac75608cf

📥 Commits

Reviewing files that changed from the base of the PR and between d5841a1 and 8bc5761.

📒 Files selected for processing (2)
  • native/core/src/execution/shuffle/spark_unsafe/list.rs
  • native/core/src/execution/shuffle/spark_unsafe/row.rs

Comment on lines +280 to +283
// SparkUnsafeRow field offsets are always 8-byte aligned: the base address is 8-byte
// aligned (JVM guarantee), bitset_width is a multiple of 8, and each field slot is
// 8 bytes. This means we can safely use aligned ptr::read() for all typed accesses.
impl_primitive_accessors!(read);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C3 '\bpoint_to_slice\s*\(' --type rust

Repository: martin-augment/datafusion-comet

Length of output: 1955


🏁 Script executed:

#!/bin/bash
# Check the Aligned struct definition and its repr attribute
sed -n '1690,1700p' native/core/src/execution/shuffle/spark_unsafe/row.rs

# Check the benchmark context
sed -n '225,250p' native/core/benches/row_columnar.rs

Repository: martin-augment/datafusion-comet

Length of output: 1426


🏁 Script executed:

#!/bin/bash
# Find the benchmark function signature and where it's called from
rg -B5 'rows: &\[Vec<u8>\]' native/core/benches/row_columnar.rs | head -40

Repository: martin-augment/datafusion-comet

Length of output: 254


🏁 Script executed:

#!/bin/bash
# Find calls to run_benchmark to see how rows are created
rg -A10 'run_benchmark\(' native/core/benches/row_columnar.rs | head -60

Repository: martin-augment/datafusion-comet

Length of output: 1397


🏁 Script executed:

#!/bin/bash
# Find where rows Vec<Vec<u8>> is populated in the benchmark
rg -B15 'run_benchmark.*rows' native/core/benches/row_columnar.rs | grep -A15 'let mut rows\|rows\.push\|rows = ' | head -50

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

#!/bin/bash
# Get broader context - look at the main benchmark setup
sed -n '1,150p' native/core/benches/row_columnar.rs | tail -80

Repository: martin-augment/datafusion-comet

Length of output: 2852


🏁 Script executed:

#!/bin/bash
# Check the impl_primitive_accessors macro and how it's used with aligned reads
sed -n '280,290p' native/core/src/execution/shuffle/spark_unsafe/row.rs

# Also check the aligned read/write at lines 347-370
sed -n '347,375p' native/core/src/execution/shuffle/spark_unsafe/row.rs

Repository: martin-augment/datafusion-comet

Length of output: 2007


Enforce the 8-byte row alignment invariant at the safe entry point.

impl_primitive_accessors!(read) and the aligned ptr::read()/ptr::write() calls in is_null_at() and set_not_null_at() require row_addr to be 8-byte aligned, but SparkUnsafeRow::point_to_slice(&[u8]) accepts unaligned slices. The benchmark passes Vec<u8> buffers (alignment = 1) to point_to_slice(), which later causes undefined behavior when aligned pointer operations dereference misaligned addresses.

Add an alignment assertion at the safe boundary to catch this at runtime:

Suggested guard
 pub fn point_to_slice(&mut self, slice: &[u8]) {
+    assert!(
+        slice.is_empty()
+            || (slice.as_ptr() as usize) % std::mem::align_of::<i64>() == 0,
+        "SparkUnsafeRow::point_to_slice requires an 8-byte aligned buffer"
+    );
     self.row_addr = slice.as_ptr() as i64;
     self.row_size = slice.len() as i32;
 }

The test at line 1695 is safe because it uses #[repr(align(8))], but the benchmark at line 236 passes unaligned Vec<u8> slices and will panic or exhibit undefined behavior without this guard.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@native/core/src/execution/shuffle/spark_unsafe/row.rs` around lines 280 -
283, The safe constructor SparkUnsafeRow::point_to_slice(&[u8]) currently
accepts unaligned slices but downstream helpers generated by
impl_primitive_accessors!(read) and methods is_null_at() / set_not_null_at() use
aligned ptr::read()/write(), so add a runtime alignment check in point_to_slice
that asserts the row base pointer is 8-byte aligned (e.g., check (row_addr as
usize) % 8 == 0 or use is_aligned_to(8)) and panic with a clear message if not;
this enforces the 8-byte alignment invariant at the safe entry point and
prevents undefined behavior when benchmarks pass unaligned Vec<u8> buffers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants